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Abstract 

Conventional vision algorithms adopt a single type of 
feature or a simple concatenation of multiple features, 
which is always represented in a high-dimensional space. 
In this paper, we propose a novel unsupervised spectral 
embedding algorithm called Kernelized Multiview Projec¬ 
tion (KMP) to better fuse and embed different feature rep¬ 
resentations. Computing the kernel matrices from differ¬ 
ent features/views, KMP can encode them with the corre¬ 
sponding weights to achieve a low-dimensional and seman¬ 
tically meaningful subspace where the distribution of each 
view is sufficiently smooth and discriminative. More cru¬ 
cially, KMP is linear for the reproducing kernel Hilbert 
space (RKHS) and solves the out-of-sample problem, which 
allows it to be competent for various practical applications. 
Extensive experiments on three popular image datasets 
demonstrate the effectiveness of our multiview embedding 
algorithm. 

1. Introduction 

Traditional computer vision techniques are mainly based 
on single feature representations, either global [21] or lo¬ 
cal [7]. For local methods, descriptors such as SIFT [17] 
are computed for each detected or densely sampled point, 
then the Bag-of-Words scheme or its improved version is 
employed to embed these local features into a whole rep¬ 
resentation. On the one hand, local feature based methods 
tend to be more robust and effective in challenging scenar¬ 
ios, while this kind of representation is often not precise 
and informative because of the quantization error during 
the codebook construction and the loss of structural rela¬ 
tionships among local features. On the other hand, global 
representations [18, 10] describe the image as a whole. Un¬ 
fortunately, global methods are sensitive to shift, scaling, 
occlusion and cluttering, which commonly exist in realistic 
images. 

Notwithstanding the remarkable results achieved by both 
local and global methods in some cases, most of them are 


still based on a single view (feature representation). In re¬ 
alistic applications, variations in lighting conditions, intra¬ 
class differences, complex backgrounds and viewpoint and 
scale changes all lead to obstacles for robust feature extrac¬ 
tion. Naturally, single representations cannot handle realis¬ 
tic tasks to a satisfactory extent. 

In practice, a typical sample can be represented by dif¬ 
ferent views/features, e.g., gradient, shape, color, texture 
and motion. Generally speaking, these views from differ¬ 
ent feature spaces always maintain their particular statisti¬ 
cal characteristics. Accordingly, it is desirable to incorpo¬ 
rate these heterogeneous feature descriptors into one com¬ 
pact representation, leading to the multiview learning ap¬ 
proaches. These techniques have been designed for multi¬ 
view data classification [29], clustering [6] and feature se¬ 
lection [28]. For such multiview learning tasks, the fea¬ 
ture representations are usually very high-dimensional for 
each view. However, little effort has been paid to learning 
low-dimensional and compact representations for multiview 
computer vision tasks. Thus, how to obtain an effective low¬ 
dimensional embedding to discover the discriminative in¬ 
formation from all views is a worthy research topic, since 
the effectiveness and efficiency of the methods drop ex¬ 
ponentially as the dimensionality increases, which is com¬ 
monly referred to as the curse of dimensionality. 

Existing multiview embedding techniques include the 
multiview spectral embedding (MSE) [24] and the multi¬ 
view stochastic neighbor embedding (m-SNE) [27], which 
have explored the locality information and probability dis¬ 
tributions for the fusion of multiview data respectively. Re¬ 
cently, Han et al. [12] proposed a sparse unsupervised di¬ 
mensionality reduction to obtain a sparse representation for 
multiview data. However, these methods are only defined 
on the training data and it remains unclear how to embed the 
new test data due to their nonlinearity. In other words, they 
suffer from the out-of-sample problem [3], which heavily 
restricts their applicability in realistic and large-scale vision 
tasks. 

In this paper, to tackle the out-of-sample problem, we 
propose a novel unsupervised multiview subspace learn- 
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ing method called kernelized multiview projection (KMP), 
which can successfully learn the projection to encode dif¬ 
ferent features with different weights achieving a seman¬ 
tically meaningful embedding. KMP considers different 
probabilistic distributions of data points and the locality in¬ 
formation among data simultaneously. Different from the 
measurement of locality information in the locality preserv¬ 
ing projections (LPP) [26] and the locally linear embedding 
(LLE) [20], an -graph [9, 15] is applied to generate the 
similarity matrix, which is shown to be more robust to data 
noise and automatically sparse. Moreover, the f^-graph can 
also adaptively discover the natural neighborhood informa¬ 
tion for each data point. 

Instead of using the multiview features directly, the ker¬ 
nel matrices from multiple views enable KMP to normal¬ 
ize the scales and the dimensions of different features. In 
fact, we show that the fusion of multiple kernels is actually 
the concatenation of features in the high-dimensional re¬ 
producing kernel Hilbert space (RKHS), while the learning 
phase of KMP remains in the low-dimensional space. Hav¬ 
ing obtained kernels for each view in RKHS, KMP can not 
only fuse the views by exploring the complementary prop¬ 
erty of different views as multiple kernel learning (MKL) 
[14, 11, 23], but also hnd a common low-dimensional sub¬ 
space where the distribution of each view is sufficiently 
smooth and discriminative. Note that multiview learning 
techniques are used to fuse different views/features while 
MKL is used to combine different kernel functions. 

The rest of this paper is organized as follows. In Section 
2, we give a brief review of the related work. The details 
of our method are described in Section 3. Section 4 reports 
the experimental results. Finally, we conclude this paper in 
Section 5. 

2. Related Work 

A simple multiview embedding framework is to concate¬ 
nate the feature vectors from different views together as a 
new representation and utilize an existing dimensionality 
reduction method directly on the concatenated vector to ob¬ 
tain the hnal multiview representation. Nonetheless, this 
kind of concatenation is not physically meaningful because 
each view has a specihc characteristic. And, the relationship 
between different views is ignored and the complementary 
nature of intrinsic data structure of different views is not 
sufficiently explored. 

One feasible solution is proposed in [16], namely, dis¬ 
tributed spectral embedding (DSE). For DSE, a spectral em¬ 
bedding scheme is hrst performed on each view, respec¬ 
tively, producing the individual low-dimensional represen¬ 
tations. After that, a common compact embedding is hnally 
learned to guarantee that it would be similar with all single¬ 
view’s representations as much as possible. Although the 
spectral structure of each view can be effectively considered 


for learning a multiview embedding via DSE, the comple¬ 
mentarity between different views is still neglected. 

To effectively and efficiently learn the complementary 
nature of different views, multiview spectral embedding 
(MSE) is introduced in [24]. The main advantage of MSE is 
that it can simultaneously learn a low-dimensional embed¬ 
ding over all views rather than separate learning as in DSE. 
Additionally, MSE shows better effectiveness in fusing dif¬ 
ferent views in the learning phase. 

However, both DSE and MSE are based on nonlinear 
embedding, which leads to a serious computational com¬ 
plexity problem and the out-of-sample problem [3]. In par¬ 
ticular, when we apply them to classification or retrieval 
tasks, the methods have to be re-trained for learning the low¬ 
dimensional embedding when new test data are used. Due 
to their nonlinearity nature, this will cause heavily compu¬ 
tational costs and even become impractical for realistic and 
large-scale scenarios. 

Towards solving the out-of-sample problem for multi¬ 
view embedding, we propose a unsupervised projection 
method, namely, KMR It is noteworthy that, as a linear 
method, a projection is learned via the proposed KMP using 
all of the training data. Nevertheless, different from non¬ 
linear approaches, once the learning phase hnishes, the pro¬ 
jection will be fixed and can be directly applied to embed 
any new test sample without re-training. 

3. Kernelized Multiview Projection 

3.1. Notations 

Given N training samples {S'!, • • • ,Sn} and M differ¬ 
ent descriptors for multiview feature extraction, S 
represents the feature vector for the i-th view and p-th sam¬ 
ple. Since the dimensions of various descriptors are dif¬ 
ferent, kernel matrices Ki,--- ,Km € are con¬ 

structed by the kernel functions such as the RBF kernel and 
the polynomial kernel, for the fusion of different views in 
the same scale. Our task is to output an optimal projection 
matrix P G and weights (ai,--- ,aM) satisfying 

X]i=i = 1 for kernel matrices such that the fused feature 
matrix Y = [yi, • • • ,yjv]^ = KP = ctiKi)P can 

represent original multiview data comprehensively. 

3.2. Formulation of KMP 

The projection learning of KMP is based on the similar¬ 
ity matrix Wi for the i-th view, i = 1,2, • • • , M. For each 
view, we value the similarity of each sample pair by using 
the neighbors of each point. The construction of Wi is il¬ 
lustrated below via the -graph [9], which is demonstrated 
to be robust to data noise, automatically sparse and adaptive 
to the neighborhood. 


Similarity construction For each X^, we find the coef¬ 
ficients /3 e such that X"^ = Bj3, where B = 

[Xj,--- ,X^] G K^.x(^-i). Consid¬ 

ering the noise effect, we can rewrite it as = B' 13', 
where B' = [B,I] G K£>.x(D.+tv-i) g ^ d ,+ n - i ^ 

Thus, seeking the sparse representation for X^ leads to the 
following optimization problem: 

argminllXp - B'/ 3 '|j 2 , s.t. ||/3'||i < e, (1) 
/ 3 ' 


since we have 

M M 

K = Y,a,K, = Y. 

2=1 2=1 


v^ai0i(X^) 

T 

V^0i(Xi) 

. V^(Pm{X^) _ 


_ ^Pm{X^) _ 


= mfct^iX), 


where e is the parameter with a small value. This problem 
can be solved by the orthogonal matching pursuit [19]. 

Considering different probabilistic distributions that ex¬ 
ist over the data points and the natural locality information 
of the data, we first employ the Gaussian mixture model 
(GMM) on the training data for each view. On the one 
hand, it has been proved that data in the high-dimensional 
space do not always follow the same distribution, but are 
naturally clustered into several groups. On the other hand, 
realistic data distributions basically follow the same form, 
i.e., Gaussian distribution. In this case, G clusters are ob¬ 
tained by the unsupervised GMM clustering for each view. 
Thus, we can solve the above problem (1) using the data 
from the same cluster to represent each point rather than the 
whole data points B, which is also regarded as a solution to 
alleviate the computational complexity of problem (1). 

In particular, for (3' = (/3i, • • • , /3£i.+jv_i), we can first 
set /3q = 0 if X® and X* are in different clusters, Vg ^ p, 
then solve the above problem. Now the similarity matrix 
Wi G cajjbe defined as: {Wi)pp = 0, Vp, {Wi)pg = 

\f3q\ if q < p, and {Wi)pq = |/3q-i| if q > p. To ensure the 
symmetry, we update Wi ^ (WV+W^i)/2. Then we set the 
diagonal matrix Di G K^xa? 

and the Laplacian matrix Li = Di — Wi for each view i. 


Multiview kernel fusion Due to the complementary na¬ 
ture of different descriptors, we assign different weights for 
different views. The goal of KMP is to find the basis of a 
subspace in which the lower-dimensional representation can 
preserve the intrinsic structure of original data. Therefore, 
we impose a set of nonnegative weights a = (ai, • • • , ctm) 
on the similarity matrices Wi ,■■■ , Wm and we have the 
fused similarity matrix W = fused diagonal 

matrix D = and the fused Laplacian matrix 

L = Efli wW 

For the kernel matrix, we also define the fused kernel 
matrix K — ctiKi- In fact, suppose (pi is the substan¬ 

tial feature map for kernel Ki, i.e., Ki = 0 i(X*)^(/)i(X*), 
then the fused kernel value is computed by the feature vec¬ 
tor concatenated by the mapped vectors via (pi, - ■ ■ , (pM, 


where (p{-) = [^/aPpi{-)'^, ■ ■ ■ is the 

fused feature map and X = (X^, • • • , X^) is the M-tuple 
consisting of features from all the views. 

To preserve the fused locality information, we need to 
find the optimal projection for the following optimization 
problem: 


argmin^ ||v^^/ip - v^?/>,||2(fF)p9, (2) 

where ipp is the fused mapped feature, i.e., [' 01 , • • • , ipj^] = 
0(X). Through simple algebra derivation, the above opti¬ 
mization problem can be transformed to the following form: 


argminTr(v^(/)(X)L0(X)^v). (3) 

V 

With the constraint Tr(v^0(X)i90(X)^v) = 1, minimiz¬ 
ing the objective function in Eq. (3) is to solve the following 
generalized eigenvalue problem: 

P{X)L(P{Xfv = XP{X)D(P{Xfv. (4) 


Note that each solution of problem (4) is a linear combi¬ 
nation of ipi, - ■ ■ ,4 ’n, and there exists an X-tuple p = 
(pi,-- - ,pnY' G such that V = Yn^iPi'ipi = 0(X)p. 
For matrix V consisting of all the linearly independent so¬ 
lutions of problem (4), there exists a matrix P such that 
V = (p{X)P. Therefore, with the additional constraint 
Tr(P^0(X)Z3(/)(X)^P) = 1 , we can formulate the new 
objective function as follows: 

arg min Tt{P^ KLKP) 

P,a 

M (5) 

s.t. Tt{P'^KDKP) = 1, ^ a, = 1, a* > 0, 

2=1 


or in the form associated with the norm constraint: 


arg min 

P,a 


Tr{P^KLKP) 
TiiP'^KDKP) ’ 


M 

s.t. = 1 , ai> 0 . ( 6 ) 

2=1 







3.3. Alternate Optimization via Relaxation 

In this section, we employ a procedure of alternate opti¬ 
mization [4] to derive the solution of the optimization prob¬ 
lem. To the best of our knowledge, it is difficult to find its 
optimal solution directly, especially for the weights in ( 6 ). 

First, for a fixed a, finding the optimal projection P is 
simply reduced to solve the generalized eigenvalue problem 

KLKp = \KDK-p, (7) 

and set P = [pi, • • • ,Pd] corresponds to the smallest d 
eigenvalues based on the Ky-Fan theorem [5]. 

Next, to optimize a, we derive a relaxed objective func¬ 
tion from the original problem. The output of the relaxed 
function can ensure that the value of the objective function 
in ( 6 ) is in a small neighborhood of the true minimum. 

We fix the projection P to update a individually. With¬ 
out loss of generality, we first consider the condition that 
M = 2, i.e., there are only two views. Then the optimiza¬ 
tion problem ( 6 ) is reduced to 

Ti{P^KLKP) 

‘T™"'Tr(pi'A'DA-p)' = '' 

(9) 

For simplicity, we denote Lijk = Tt{P"’"K iLkKjP) and 
D,jk = Ti{P'^KiDkKjP), i,j,k G {1,2}. Then we can 
simply find that = Ljik and Ajfe = Djik. 


Relaxation With the Cauchy-Schwarz inequality [13], 
the relaxation for the objective function in (9) is shown 
in Eq. ( 8 ), where Wijk is the coefficient of and 
Sij fcGfi 2 } '^ijk = 1. In this way, the objective function 
in (9) is relaxed to a weighted sum of . Thus, minimiz- 

^ij k 

ing the weighted sum of the right-hand-side in ( 8 ) can lower 
the objective function value in (9). Note that 


aiQfi = -ai ■ ai ■ 2a2 < - 


1 / Qfi -I- ai -I- 2(22 
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and then the weights without containing af and are al¬ 
ways smaller than a constant. Therefore, we only ensure 
that a part of the terms in the weighted sum is minimized, 
i.e., to solve the following optimization problem; 

Till T 222 

arg min Will---f ^ 2227 ^—, s.t. wm + W 222 = 1- 

Qi,C(2 kJ222 

( 10 ) 

Since wm and W 222 are the functions of (ai, a 2 ), we first 
find the optimal weights without parameters (ai,a 2 ). To 
avoid trivial solution, we assign an exponent r > 1 for each 
weight. By denoting 71 = wm and 72 = W 222 , the relaxed 


optimization will be 


arg min 71 


^111 


Pi 


'72 


^222 


Do 


s.t. 7 i -1-72 = 1,71,72 > 0 . 

( 11 ) 

For (11), we have the Lagrangian function with the La- 
grangian multiplier p; 

T(7i,72,r?) = 7[+72 - 7 ( 71 +72-1)- (12) 

r-^lll ^222 

We only need to set the derivatives of L with respect to 71 , 
72 and 77 to zeros as follows: 

dL 

dji ■ Dm 

dL _ +222 

dl2 
dL 


r-lTlll n 

= ^7i t;-7 = 0, 


= n2 


Do 


-7 = 0 , 


= 71 + 72 - 1 = 0 . 

Then 71 and 72 can be calculated by 
(L 222 P 111 ) 


(13) 

(14) 

(15) 


7i = 


72 = 


(+222^111)’“^ +(+111^222)''“^ 
1 

(L 111 P 222 ) 


(16) 


(L 222 P 111 )+(+ 111 ^ 222 )’'“^ 

Having acquired 71 and 72 , we can obtain ai and ao by 
the corresponding relationship between the coefficients of 
the functions in ( 10 ) and ( 11 ): 

af+iii _ Will _ tT 


(17) 


(^2+222 W 222 

With the constraint ai + ao = 1, we can easily find that 

( 71 + 222 )® 

0^1 — -1-r5 

( 71 + 222 )® + ( 72 + 111 )® ..gx 

(79+111) ® 

0:2 = 


(71+222)3 -I- (7^+111)3 

Hence, for the general M-view situation, we also have 
the corresponding relaxed problems; 


arg mm 

57“i “i=i ij^ke{i,-- ,M} 


Wijkiai,- ■ ■ , um) 


^ijk 

^ijk 


(19) 


and 


M 


arg mm 


XI7i = 1 , 7i > 0. (20) 


M 


D 


The coefficients (7i,--- , 7 m) {ctir '' ^ck-m) can be 
obtained in similar forms: 


li = 


{Dm/Lin )3 




1 , ^ 1, 




( 21 ) 




















Ti{P'^KLKP) + oi2K2){aiLi + a2L2){aiKi + a2K2)Pj 

Py{P^KDKP) Tr + a 2 i^ 2 )(aiii + a2L2){aiKi + a2K2)p') 


OZ-iLiii + 2o'iO'2Li2l + QfiQ!2'^221 + <2l 0^2^112 + 2010:2^122 + <32^222 

Q;iI? 111 + 20i02l?121 + Q:i02-D221 + Q;i02-Di12 + 2 O 1 O 2 -D 122 + a^D22 


< 


a\L 


1 

'111 + 20 i 02 il 21 + C 1 iQ: 2-^221 + CliCl 2-^112 + 2 oio|Li 22 + 0 ^ 2^222 
(OiLiii)^ _l_ (20^02^121)^ ^ (q;i 02 -Zj 22 i)^ (Q;i 02 -Zj 112)^ (2ai02-Zjl22)^ 

OiiDiii 2 O 1 O 2 -D 12 I O 1 O 2 -D 22 I Cti 02 i ?112 2 O 1 O 2 I? 


( 02 ^ 222 )^ 

alD 


'■ 2^222 


„3r ^111 

1 ^ 1117 :,— 


2oi02l?i22 

_1_ ^ / 3 

OiLiii + 20i02il21 + OlCl2^221 + CliCl2ill2 + 20102^122 + Oi ^ L 222 ^ ^ 

, r, 2 T -^121 , 2r ^^221 , 2 T ^112 , r, 2r ^122 , 3^ -^^222 ^ 

+ 20102-01217:;-1- CHCl2^22177-1- Clia2-Oii277-h 2O1O2-O12277-1- 0:2-022277- I 

-0'i21 -0'221 -t2ll2 -0'l22 J ^ 222 ' 


^ Wijfc ( 01 , 02 )-^. 


and 


{iHuS 


j = 1, 


, M. (22) 


Convergence Although the weight o obtained in the 
above procedure is not the global minimum, the objective 
function is ensured in a range of small values. We let Fi 
and F 2 be the objective functions in ( 6 ) and (19), respec¬ 
tively, and let 

T ^ T 

i"3= E (23) 

i—j—k i—\ 

We can find that Fi < F 2 and if there exists oi = 1 for 
some i, then Fi — F 2 = F 3 . During the alternate proce¬ 
dure, for optimizing P, Fi is minimized, and for optimiz¬ 
ing o, F 3 is minimized. Denote mi — max(Fi — F 3 ) and 
(Pi, oi) = arg max(Pi — P3), then we have 

minP 3 + mi < P3(Pi, oi) -f (Pi - P3)(Pi, Oi) 

= Pi(Pi,Oi) < max Pi, 

and we can define the following nonnegative continuous 
function: 

P 4 (P, a) = max ^Pi(P, a), min (P 3 (P, aj+mi)^ (24) 

Note that mino, (P 3 (P, a) + mi) is independent of a, 
thus for any P, there exists ag, such that Pi(P, ag) = 
min^ (P 3 (P, a) + mi). If we impose the above alternate 
optimization on P4, P4 is nonincreasing and therefore con¬ 
verges. Though a does not converge to a fixed point, the 


value of Pi is reduced into a small district, which is smaller 
than minQ, P 3 plus a constant. It is also worthwhile to note 
that P 3 is actually the weighted sum of the objective func¬ 
tions for preserving each view’s locality information. How¬ 
ever, the optimization for P 3 still learns information from 
each view separately, i.e., the locality similarity is not fused. 
We summarize the KMP in Algorithm 1. 


Algorithm 1 Kernelized Multiview Projection 

Input: The training samples {Pi, • • • ,Sn} and parameter 
r > 1. 

Output: The projection matrix P G and the weights 

a = (ai, • • • , aiyi) & for kernel matrices. 

1 : Extract multiple features from each training image and 
obtain data matrices A*, p = 1, • • • , N, i = 1, ■ ■ ■ ,M-, 

2 : Compute the similarity matrices Wi , • • • , Wm and the 
Laplacian matrices Li ■ ■ ■ , Lm by solving the opti¬ 
mization problem in ( 1 ) for each view; 

3: Compute the kernel matrices Pi, • • • ,Km G 

and the Laplacian matrices Pi, • • • , Pm € for 

M views; 

4: Initialize a ^ (A,... , 

5: repeat 

6 : Compute the fused kernel matrix K = X^fci 

and the fused Laplacian matrix P = X]^i 

7: Compute P by solving the generalized eigenvalue 

problem (7); 

8 : Compute coefficients 7 = ( 7 i,'- - ,7m) by Eq. 

( 21 ); 

9: Transform 7 to a by Eq. (22); 

10 : until P4 defined in Eq. (24) converges. 


























Table 1. Dimensions of four features for image classification. 


Feature representation 

Dimension 

Histogram of oriented gradients (HOG) 

225 

Local binary pattern (EBP) 

256 

Color histogram (ColorHist) 

192 

GIST 

384 

Total dimension 

1057 


4. Experiments and Results 

In this section, we evaluate our Kernelized Multiview 
Projection (KMP) on three image datasets: CMU PIE, CI- 
FARIO and SUN397 respectively. The CMU PIE face 
dataset [8] contains 41,368 images from 68 subjects (peo¬ 
ple). Following the settings in [8], we select 11, 554 front 
face images, which are manually aligned and cropped into 
32 X 32 pixels. Further, 7, 500 images are used as the 
training set and the remaining 4,054 images are used for 
testing. The CIFARIO dataset [22] is a labeled subset of 
the 80-million tiny images collection. It consists of a to¬ 
tal of 60,000 32 X 32 color images in 10 classes. The en¬ 
tire dataset is partitioned into two parts: a training set with 
50,000 samples and a test set with 10, 000 samples. The 
SUN397 dataset [25] contains 108, 754 scene images in to¬ 
tal from 397 well-sampled categories with at least 100 im¬ 
ages per category. We randomly select 50 samples from 
each category to construct the training set and the rest of 
samples are the test set. Thus, there are 19,850 and 88,904 
images in the training set and test set, respectively. 

4.1. Compared Methods and Settings 

For image classification, each image can be usually de¬ 
scribed by different feature representations, i.e., multiview 
representation, in high-dimensional feature spaces. In this 
paper, we adopt four different feature representations: HOG 
[10], EBP [1], ColorHist and GIST [18] to describe each 
image. Table 1 illustrates the original dimensions of these 
features. 

We compare our proposed KMP with two related multi¬ 
kernel fusion methods. In particular, the RBF kernels' for 
each view are adopted in the proposed KMP method: 

M 

i=l 

where the weight ai is obtained via alternate optimization. 
AM indicates that the kernels are combined by arithmetic 
mean: 

1 ^ 
i=l 

*Our approach can work with any legitimate kernel function, though 
we focus on the popular RBF kernel in this paper 


Table 2. Performance comparison (%) between the SVM using 
multiple features through KMP and the SVM using single origi¬ 
nal features. The numbers in parentheses indicate the dimensions 
of the representations. For MKL-SVM, -graph is also used to 
construct the kernel matrix for each view and then MKL-SVM is 
applied to final classification. 


Dataset 

CMU PIE 

CIFARIO 

SUN397 

HOG 

83.3 

70.2 

29.3 

LBP 

74.6 

54.2 

20.4 

ColorHist 

31.2 

23.0 

9.3 

GIST 

94.2 

82.3 

17.5 

Concatenation 

93.4 

82.8 

31.9 

MKL-SVM 

95.6 

86.3 

30.7 

KMP 

99.5(60) 

89.7(80) 

40.5(70) 


and GM denotes the combination of kernels through geo¬ 
metric mean: 

M 

Kgm = • 

i=l 

Besides, we also include the best performance of the single- 
view-based spectral projection (BSP), the average perfor¬ 
mance of the single-view-based spectral projection (ASP) 
and the concatenation of single-view-based embeddings 
(CSP) in our compared experiments. In particular, AM and 
GM are incorporated with the proposed KMP framework. 
BSP, ASP and CSP are based on the kernelized extension 
of Discriminative Partition Sparsity Analysis (DPSA) [15] 
technique. In addition, two non-linear embedding methods, 
distributed spectral embedding (DSE) and multiview spec¬ 
tral embedding (MSE), are adopted in our comparison, as 
well. In DSE and MSE, the Laplacian eigenmap (EE) [2] 
is adopted. For all these compared embedding methods, the 
RBF-SVM is adopted to evaluate the final performance. 

All of the above methods are then evaluated on seven 
different lengths of codes: {20,30,40,50,60, 70,80}. Un¬ 
der the same experimental setting, all the parameters used 
in the compared methods have been strictly chosen accord¬ 
ing to their original papers. For KMP and MSE, the opti¬ 
mal balance parameter r for each dataset is selected from 
one of (2, 3,4, 5,6, 7, 8,9,10}, which yields the best per¬ 
formance by 10-fold cross-validation on the training set. 
The number of the GMM clusters G in KMP is selected 
from one of {10, 20 ,..., 100} with a step of 10 via cross- 
validation on the training data. The same procedure occurs 
on the selection of sparsity hyperparameter e from one of 
{5,8,10,12,15,18, 20}. The best smooth parameter a in 
the construction of the RBF kernel and RBF-SVM is also 
chosen by the cross-validation on the training data. Since 
the clustering procedure has uncertainty, all experiments are 
performed five times repeatedly and each of the results in 
the following section is the averages of five runs. 





















Figure 1. Performance comparison (%) of KMP with different multiview embedding methods on the three datasets. 


4.2. Results 

In Table 2, we first illustrate the performance of the orig¬ 
inal single-view representations on all the three datasets. In 
detail, we extract original feature representations under one 
certain view and then directly feed them to the SVM for 
classification. From the comparison, we can easily observe 
that the GIST features consistently outperform the other de¬ 
scriptors on the CMU PIE and CIFARIO datasets but HOG 
takes the superior place on the SUN397 dataset. The lowest 
accuracy is always obtained by ColorHist. Furthermore, we 
also include the long representation, which is concatenated 
by all the four original feature representations, into this 
comparison. It is shown that in most of the time the concate¬ 
nated representation can reach better performance than sin¬ 
gle view representations, but is always significantly worse 
than the proposed KMP. Additionally, the results of the 
multiple kernel learning based on SVM (MKL-SVM) [11] 
are listed in Table 2 using the same four feature descrip¬ 
tors. Specifically, the best accuracies achieved by KMP are 
99.5%, 89.7% and 40.5% on the CMU PIE, CIFARIO, and 
SUN397, respectively. 

In Fig. 1, seven different embedding schemes are com¬ 
pared with the proposed KMP on all the three datasets. 
From the comparison, the proposed KMP always leads to 
the best performance for image classification. Meanwhile, 
arithmetic mean (AM) and the single-view-based spectral 
projection (BSP) generally achieve higher accuracies than 
the best performance of geometric mean (GM) and the av¬ 
erage performance of the single-view-based spectral projec¬ 
tion (ASP). The concatenation of single-view-based embed¬ 
dings (CSP) achieves competitive performance compared 
with BSP on all the three datasets. DSE always produces 
worse performance than MSE and sometimes even obtains 
lower results than CSP. However, DSE generates better 


Table 3. Performance (%) of KMP with different r values on the 
CMU PIE dataset. 



performance than GM and ASP, since a more meaningful 
multiview combination scheme is adopted in DSE. Beyond 
that, it is obviously observed that, with different target di¬ 
mensions, there are large differences among the final re¬ 
sults. Fig. 2 plots the low-dimensional embedding results 
obtained by AM, GM, KMP, DSE and MSE on the CI¬ 
FARIO dataset. Our proposed KMP can well separate dif¬ 
ferent categories, since it takes the semantically meaningful 
data structure of different views into consideration for em¬ 
bedding. 

In addition, we can observe that with the increase of the 
dimension, all the curves of compared methods on the CI¬ 
FARIO and SUN397 datasets are climbing up except for 
DSE and MSE, both of which have a slight decrease on 
SUN397 when the dimension exceeds 70. However, on the 
CMU PIE dataset, the results in comparison always climb 
up then go down for almost every compared method ex¬ 
cept for DSE when the length of dimension increases (see 
Fig. 1). For instance, the highest accuracy on the CMU 
PIE dataset is on the dimension of 60 and the best perfor¬ 
mance on CIFARIO and SUN397 happens when d = 80 
and d = 70, respectively. 

Furthermore, some parameter sensitivity analysis is car¬ 
ried out. Table 3 illustrates the performance variation of 
KMP with respect to the parameter r on the CMU PIE 
dataset; the target dimensionality of the low-dimensional 
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Figure 2. Illustration of low-dimensional distributions of five different fusion schemes (illustrated with data of three categories from the 
CIFARIO dataset). 





Figure 3. The curves on the left side show the best perfor¬ 
mance on the training data when e is equal to one value 
from {5,8,10,12,15,18,20} while G varies its value in 
{10, 20, • ■ ■ , 100}, and vice versa. 


4.3. Time Consumption Analysis 

In this section, we compare the training and coding time 
of the proposed KMP algorithm with other methods. As we 
can see from Table 4, our method can achieve competitive 
training time compared with the state-of-the-art multiview 
and multiple kernel learning methods. Since there is no em¬ 
bedding procedure in MKL, the coding time is not applica¬ 
ble for MKL. Due to the nature of DSE and MSE, they need 
to be re-trained when receiving a new test sample. In con¬ 
trast, once the projection and weights are gained by KMP, 
they are fixed for all test samples and implemented in a fast 
way. All the experiments are completed using Matlab 2014a 
on a workstation configured with an i7 processor and 32GB 
RAM. 


Table 4. Comparison of training and coding time (seconds) for 
learning 80 dimensional embedded features on the three datasets. 


Dataset 

Phase 

DSE 

MSE 

MKL 

KMP 

CMU PIE 

Training time 
Coding time/qiiery 

1148.24 

1156.01 

716.79 

790.09 

873.72 

755.28 

0.032 

CIFAR 10 

Training time 
Coding time/qiiery 

1683.70 

1696.52 

1026.32 

1072.18 

1098.97 

991.54 

0.041 

SUN397 

Training time 
Coding time/query 

2804.91 

2812.36 

1778.74 

1784.50 

1678.14 

1694.10 

0.036 


embedding d is fixed at {20, 30,..., 80} with a step of 
10, respectively. By adopting the 10-fold cross-validation 
scheme on the training data, it is demonstrated that higher 
dimensions prefer a larger r in our KMP. Einally, Pig. 3 
shows the variation of parameters G and e on all three 
datasets. The general tendency of these curves is consis¬ 
tently shown as “rise-then-fall”. It can be also seen from 
this figure that a larger training set needs larger values of G 
and e, and vice versa. 


5. Conclusion 

In this paper, we have presented an effective subspace 
learning framework called Kernelized Multiview Projection 
(KMP). KMP, as an unsupervised method, can encode a va¬ 
riety of features in different ways, to achieve a semantically 
meaningful embedding. Specifically, KMP is able to suc¬ 
cessfully explore the complementary property of different 
views and finally find the low-dimensional subspace where 
the distribution of each view is sufficiently smooth and dis¬ 
criminative. KMP can be regarded as a fused dimension¬ 
ality reduction method for multiview data. We have ob¬ 
jectively evaluated our approach on three datasets: CMU 
PIE, CIPARIO and SUN397. The corresponding results 
have shown the effectiveness and the superiority of our al¬ 
gorithm compared with other multiview embedding meth¬ 
ods. Por future work, we plan to combine the current KMP 
approach with semi-supervised learning for other computer 
vision tasks. 
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